Uno de los esquemas más importantes que han evolucionado gracias a la tecnología es el ambiente financiero. El cual ha facilitado a los usuarios préstamos a sus clientes para diferentes inversiones que puedan requerir los usuarios y generar utilidades mediante este tipo de transacciones.
Los bancos incurren en pérdidas cuando un cliente no paga sus préstamos a tiempo. Debido a esto, cada año, los bancos tienen pérdidas en crores, y esto también impacta en gran medida el crecimiento económico del país. En este hackathon, analizamos varios atributos como el monto financiado, la ubicación, el préstamo, el saldo, etc., para predecir si una persona será un incumplidor de préstamos o no.
Por esta razón, la entidad financiera cuenta con una base de datos que registra la información de los préstamos y el estado de cartera, dónde se conoce si el usuario se encuentra o no en mora.
Objetivos de desarrollo:
Datos: La fuente de datos inicial fue tomada del dataset Bank Loan Defaulter Prediction, no obstante, la información ha sido preparada para trabajar en el modelamiento.
Para tener un mejor detalle sobre el comportamiento de las variables, solicitamos a la organización el diccionario de datos y nos suministró la siguiente información:
| ATRIBUTO | DEFINICIÓN |
|---|---|
| ID | Identificador único de la instancia. |
| Loan Amount | Valor del monto del préstamo. |
| Funded Amount | Valor financiado del préstamo. |
| Funded Amount Investor | Inversor de la canidad financiada. |
| Term | Plazo del préstamo (en meses). |
| Interest Rate | Tasa de interés del préstamo. |
| Grade | Calificación del banco. |
| Employment Duration | Tiempo laborado. |
| Home Ownership | Propietario de la vivienda. |
| Verification Status | Estado de verificación de ingresos por parte del banco. |
| Debit to Income | Relación entre el pago mensual de la dueda, dividida entre los ingresos mensuales autodeclarados. Exlucye la hipoteca. |
| Delinquency - two years | Tiempo reportado como moroso (30+número de días) en los 2 últimos años. |
| Inquires - six months | Número total de consultas en los 6 últimos meses. |
| Open Account | Número de líneas de crédito abiertas en la línea de crédito del representante. |
| Public Record | Número de registros públicos desfavorables. |
| Revolving Balance | Saldo total de crédito rotativo. |
| Revolving Utilities | Cantidad de crédito que está utilizando un representante en relación con Revolving Balance. |
| Total Accounts | Número total de líneas de crédito disponibles en la línea de crédito del representante. |
| Initial List Status | Estado de listado único del préstamo - W (En espera), F (Enviado). |
| Total Received Interest | Intereses totales recibidos hasta la fecha. |
| Total Received Late Fee | Total de tarifas por mora recibidas hasta la fecha. |
| Recoveries | Recuperación bruta posterior a la cancelación de la deuda. |
| Collection Recovery Fee | Tarifa de cobranza posterior a la cancelación de la deuda. |
| Collection 12 months Medical | Cobranzas totales en los últimos 12 meses excluyendo las colecciones médicas. |
| Application Type | Indica si la representación es individual o conjunta. |
| Last week Pay | Indica cuánto tiempo (en semanas) un representante ha pagado EMI después de la inscripción en el lote. |
| Accounts Delinquent | Número de cuentas en las que el representante está en mora. |
| Total Collection Account | Saldo total actual de todas las cuentas. |
| Total Current Balance | Saldo total actual de todas las cuentas. |
| Total Revolving Credit Limit | Límite total de crédito rotativo. |
| Loan Status | Estado de cuenta, 1 indica se encuentra en mora y 0 se encuentra al día. |
Realizar el análisis exploratorio de la información.
Generar un pipeline para el tratamiento de la información.
Diseñar un autoencoder el cuál sea entrenado a partir de la clase normal. Tenga en cuenta que la métrica debe ser el accuracy. La arquitectura debe tener el número de capas que considere adecuado para el contexto de un autoencoder. Realice la interpretación de los resultados obtenidos.
Construir las gráficas de entrenamiento, validación. Debes interpretar los resultados obtenidos para este modelo base.
Realizar la identificación de hiperparámetros, justificando la elección de los valores correspondientes.
Reconstruir los resultados usando el mejor modelo y usando los datos de prueba. Realice el proceso de interpretación de los resultados obtenidos.
Bono: Almacena el joblib de la preparación de los datos y el autoencoder, de esta forma serán evaluados los resultados obtenidos y publicados en un ranking, los 3 mejores grupos con los resultados obtenidos evaluando datos nuevos y evaluando la métrica, obtendrán la bonificación de este taller.
Entregables: Debe cargar a bloque neón la siguiente información:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
import numpy as np
from sklearn.preprocessing import Normalizer, MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
#Tensorflow y keras
import tensorflow as tf
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense, Input
from keras.utils import plot_model
from keras.optimizers import Adam
/Users/mariacatalinaibanezpineres/Desktop/MAESTRIA/2024-10/Machine Learning/Talleres/Taller5/env/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
normal = pd.read_csv("https://raw.githubusercontent.com/fcastellanosp/MINE-4206_202410_AML/main/Talleres/Taller%205/normal.csv")
normal
| ID | Loan Amount | Funded Amount | Funded Amount Investor | Term | Interest Rate | Grade | Employment Duration | Home Ownership | Verification Status | ... | Recoveries | Collection Recovery Fee | Collection 12 months Medical | Application Type | Last week Pay | Accounts Delinquent | Total Collection Amount | Total Current Balance | Total Revolving Credit Limit | Loan Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65087372 | 10000 | 32236 | 12329.36286 | 59 | 11.135007 | B | MORTGAGE | 176346.62670 | Not Verified | ... | 2.498291 | 0.793724 | 0 | INDIVIDUAL | 49 | 0 | 31 | 311301 | 6619 | 0 |
| 1 | 1450153 | 3609 | 11940 | 12191.99692 | 59 | 12.237563 | C | RENT | 39833.92100 | Source Verified | ... | 2.377215 | 0.974821 | 0 | INDIVIDUAL | 109 | 0 | 53 | 182610 | 20885 | 0 |
| 2 | 1969101 | 28276 | 9311 | 21603.22455 | 59 | 12.545884 | F | MORTGAGE | 91506.69105 | Source Verified | ... | 4.316277 | 1.020075 | 0 | INDIVIDUAL | 66 | 0 | 34 | 89801 | 26155 | 0 |
| 3 | 6651430 | 11170 | 6954 | 17877.15585 | 59 | 16.731201 | C | MORTGAGE | 108286.57590 | Source Verified | ... | 0.107020 | 0.749971 | 0 | INDIVIDUAL | 39 | 0 | 40 | 9189 | 60214 | 0 |
| 4 | 14354669 | 16890 | 13226 | 13539.92667 | 59 | 15.008300 | C | MORTGAGE | 44234.82545 | Source Verified | ... | 1294.818751 | 0.368953 | 0 | INDIVIDUAL | 18 | 0 | 430 | 126029 | 22579 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61217 | 7273094 | 11440 | 17672 | 22965.76290 | 59 | 15.025260 | C | RENT | 76128.78634 | Verified | ... | 0.061096 | 0.325564 | 0 | INDIVIDUAL | 151 | 0 | 38 | 859647 | 21468 | 0 |
| 61218 | 35182714 | 8323 | 11046 | 15637.46301 | 59 | 9.972104 | C | RENT | 65491.12817 | Source Verified | ... | 2.015494 | 1.403368 | 0 | INDIVIDUAL | 14 | 0 | 37 | 22692 | 8714 | 0 |
| 61219 | 16435904 | 15897 | 32921 | 12329.45775 | 59 | 19.650943 | A | MORTGAGE | 34813.96985 | Verified | ... | 5.673092 | 1.607093 | 0 | INDIVIDUAL | 137 | 0 | 17 | 176857 | 42330 | 0 |
| 61220 | 5300325 | 16567 | 4975 | 21353.68465 | 59 | 13.169095 | D | OWN | 96938.83564 | Not Verified | ... | 1.157454 | 0.207608 | 0 | INDIVIDUAL | 73 | 0 | 61 | 361339 | 39075 | 0 |
| 61221 | 65443173 | 15353 | 29875 | 14207.44860 | 59 | 16.034631 | B | MORTGAGE | 105123.15580 | Verified | ... | 1.856480 | 0.366386 | 0 | INDIVIDUAL | 54 | 0 | 47 | 196960 | 66060 | 0 |
61222 rows × 31 columns
test = pd.read_csv("https://raw.githubusercontent.com/fcastellanosp/MINE-4206_202410_AML/main/Talleres/Taller%205/test.csv")
test
| ID | Loan Amount | Funded Amount | Funded Amount Investor | Term | Interest Rate | Grade | Employment Duration | Home Ownership | Verification Status | ... | Recoveries | Collection Recovery Fee | Collection 12 months Medical | Application Type | Last week Pay | Accounts Delinquent | Total Collection Amount | Total Current Balance | Total Revolving Credit Limit | Loan Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2567369 | 13778 | 11656 | 26105.702520 | 59 | 13.382068 | A | MORTGAGE | 49537.72103 | Verified | ... | 5.090907 | 1.171816 | 0 | INDIVIDUAL | 53 | 0 | 30 | 305148 | 1604 | 0 |
| 1 | 60690195 | 23891 | 25795 | 13254.362100 | 59 | 11.307417 | D | MORTGAGE | 72171.74951 | Verified | ... | 6.579831 | 0.551025 | 0 | INDIVIDUAL | 60 | 0 | 45 | 151330 | 20090 | 0 |
| 2 | 33860189 | 3500 | 3500 | 3500.000000 | 36 | 10.150000 | B | RENT | 43000.00000 | Not Verified | ... | 0.000000 | 0.000000 | 0 | INDIVIDUAL | 65 | 0 | 0 | 32926 | 8700 | 0 |
| 3 | 24927916 | 16588 | 9495 | 12123.544890 | 59 | 10.857981 | B | RENT | 66775.05376 | Source Verified | ... | 3.165733 | 1.170445 | 0 | INDIVIDUAL | 10 | 0 | 25 | 44232 | 88161 | 0 |
| 4 | 8658608 | 9179 | 34461 | 2120.247381 | 59 | 17.392819 | C | RENT | 105507.03900 | Source Verified | ... | 0.133916 | 0.547749 | 0 | INDIVIDUAL | 116 | 0 | 32 | 17616 | 24383 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28907 | 11791863 | 20531 | 14637 | 20873.672020 | 58 | 8.816036 | C | MORTGAGE | 54832.47671 | Source Verified | ... | 8.240559 | 0.263531 | 0 | INDIVIDUAL | 38 | 0 | 54 | 107593 | 30035 | 0 |
| 28908 | 7781823 | 9048 | 18664 | 8055.725617 | 59 | 8.206810 | C | RENT | 64463.45373 | Source Verified | ... | 3.930009 | 0.883955 | 0 | INDIVIDUAL | 2 | 0 | 28 | 352559 | 1735 | 1 |
| 28909 | 4239688 | 29084 | 31199 | 10128.795360 | 59 | 11.977165 | C | RENT | 68783.92330 | Verified | ... | 1.619850 | 0.906886 | 0 | INDIVIDUAL | 6 | 0 | 59 | 203771 | 3191 | 0 |
| 28910 | 14600261 | 12778 | 21387 | 7615.936459 | 59 | 10.311443 | D | RENT | 154143.34970 | Verified | ... | 0.823018 | 0.105983 | 0 | INDIVIDUAL | 36 | 0 | 38 | 315029 | 1385 | 0 |
| 28911 | 63239770 | 14000 | 14000 | 14000.000000 | 60 | 17.860000 | D | MORTGAGE | 43900.00000 | Verified | ... | 0.000000 | 0.000000 | 0 | INDIVIDUAL | 17 | 0 | 0 | 60654 | 13800 | 0 |
28912 rows × 31 columns
Se realiza el entendimiento de los datos a través de las siguientes etapas:
Los pasos no se realizan en orden.
Se revisa el tipo de dato de cada atributo:
normal.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61222 entries, 0 to 61221 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 61222 non-null int64 1 Loan Amount 61222 non-null int64 2 Funded Amount 61222 non-null int64 3 Funded Amount Investor 61222 non-null float64 4 Term 61222 non-null int64 5 Interest Rate 61222 non-null float64 6 Grade 61222 non-null object 7 Employment Duration 61222 non-null object 8 Home Ownership 61222 non-null float64 9 Verification Status 61222 non-null object 10 Debit to Income 61222 non-null float64 11 Delinquency - two years 61222 non-null int64 12 Inquires - six months 61222 non-null int64 13 Open Account 61222 non-null int64 14 Public Record 61222 non-null int64 15 Revolving Balance 61222 non-null int64 16 Revolving Utilities 61222 non-null float64 17 Total Accounts 61222 non-null int64 18 Initial List Status 61222 non-null object 19 Total Received Interest 61222 non-null float64 20 Total Received Late Fee 61222 non-null float64 21 Recoveries 61222 non-null float64 22 Collection Recovery Fee 61222 non-null float64 23 Collection 12 months Medical 61222 non-null int64 24 Application Type 61222 non-null object 25 Last week Pay 61222 non-null int64 26 Accounts Delinquent 61222 non-null int64 27 Total Collection Amount 61222 non-null int64 28 Total Current Balance 61222 non-null int64 29 Total Revolving Credit Limit 61222 non-null int64 30 Loan Status 61222 non-null int64 dtypes: float64(9), int64(17), object(5) memory usage: 14.5+ MB
Se observa que, al importar los datos, las columnas Grade, Employment Duration, Verification Status, Initial List Status y Application Type se importan como object por lo tanto son categóricas.
Por otro lado, se observa que las columnas ID, Loan Amount, Funded Amount, Term, Delinquency - two years, Inquires - six months, Open Account, Public Record, Revolving Balance, Total Accounts, Collection 12 months Medical, Last week Pay, Accounts Delinquent, Total Collection Amount, Total Current Balance, Total Revolving Credit Limit y Loan Status son de tipo int64. Mientras que Funded Amount Investor, Interest Rate, Home Ownership, Debit to Income, Revolving Utilities, Total Received Interest, Total Received Late Fee, Recoveries y Collection Recovery Fee son de tipo float64.
No obstante, ID es un identificador único por lo que es categórica, se procede entonces a cambiar el tipo de dato de ID a object. Por otra lado, se observa que Loan Status es una variable booleana que toma valores 0 o 1. Por lo que se procede a cambiar el tipo de dato de Loan Status a object, porque es categórica.
Esto aún no corresponde a preparación de datos. Los datos no se modifican, únicamente se cambia el tipo de dato. De lo contrario, esta exploración de datos sería engañosa.
normal['ID'] = normal['ID'].astype('object')
Se muestran a continuación los estadísticos descriptivos de las variables numéricas:
numeric_variables = normal.select_dtypes(include=['number'])
numeric_variables
| Loan Amount | Funded Amount | Funded Amount Investor | Term | Interest Rate | Home Ownership | Debit to Income | Delinquency - two years | Inquires - six months | Open Account | ... | Total Received Late Fee | Recoveries | Collection Recovery Fee | Collection 12 months Medical | Last week Pay | Accounts Delinquent | Total Collection Amount | Total Current Balance | Total Revolving Credit Limit | Loan Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000 | 32236 | 12329.36286 | 59 | 11.135007 | 176346.62670 | 16.284758 | 1 | 0 | 13 | ... | 0.102055 | 2.498291 | 0.793724 | 0 | 49 | 0 | 31 | 311301 | 6619 | 0 |
| 1 | 3609 | 11940 | 12191.99692 | 59 | 12.237563 | 39833.92100 | 15.412409 | 0 | 0 | 12 | ... | 0.036181 | 2.377215 | 0.974821 | 0 | 109 | 0 | 53 | 182610 | 20885 | 0 |
| 2 | 28276 | 9311 | 21603.22455 | 59 | 12.545884 | 91506.69105 | 28.137619 | 0 | 0 | 14 | ... | 18.778660 | 4.316277 | 1.020075 | 0 | 66 | 0 | 34 | 89801 | 26155 | 0 |
| 3 | 11170 | 6954 | 17877.15585 | 59 | 16.731201 | 108286.57590 | 18.043730 | 1 | 0 | 7 | ... | 0.044131 | 0.107020 | 0.749971 | 0 | 39 | 0 | 40 | 9189 | 60214 | 0 |
| 4 | 16890 | 13226 | 13539.92667 | 59 | 15.008300 | 44234.82545 | 17.209886 | 1 | 3 | 13 | ... | 19.306646 | 1294.818751 | 0.368953 | 0 | 18 | 0 | 430 | 126029 | 22579 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61217 | 11440 | 17672 | 22965.76290 | 59 | 15.025260 | 76128.78634 | 21.929698 | 0 | 0 | 8 | ... | 0.010722 | 0.061096 | 0.325564 | 0 | 151 | 0 | 38 | 859647 | 21468 | 0 |
| 61218 | 8323 | 11046 | 15637.46301 | 59 | 9.972104 | 65491.12817 | 17.694279 | 0 | 0 | 12 | ... | 0.027095 | 2.015494 | 1.403368 | 0 | 14 | 0 | 37 | 22692 | 8714 | 0 |
| 61219 | 15897 | 32921 | 12329.45775 | 59 | 19.650943 | 34813.96985 | 10.295774 | 0 | 0 | 7 | ... | 0.028212 | 5.673092 | 1.607093 | 0 | 137 | 0 | 17 | 176857 | 42330 | 0 |
| 61220 | 16567 | 4975 | 21353.68465 | 59 | 13.169095 | 96938.83564 | 7.614624 | 0 | 0 | 14 | ... | 0.074508 | 1.157454 | 0.207608 | 0 | 73 | 0 | 61 | 361339 | 39075 | 0 |
| 61221 | 15353 | 29875 | 14207.44860 | 59 | 16.034631 | 105123.15580 | 16.052112 | 0 | 0 | 30 | ... | 0.000671 | 1.856480 | 0.366386 | 0 | 54 | 0 | 47 | 196960 | 66060 | 0 |
61222 rows × 25 columns
numeric_variables.describe()
| Loan Amount | Funded Amount | Funded Amount Investor | Term | Interest Rate | Home Ownership | Debit to Income | Delinquency - two years | Inquires - six months | Open Account | ... | Total Received Late Fee | Recoveries | Collection Recovery Fee | Collection 12 months Medical | Last week Pay | Accounts Delinquent | Total Collection Amount | Total Current Balance | Total Revolving Credit Limit | Loan Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 61222.000000 | 61222.00000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | ... | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.000000 | 61222.0 | 61222.000000 | 6.122200e+04 | 61222.000000 | 61222.0 |
| mean | 16860.853092 | 15767.04910 | 14621.996411 | 58.177436 | 11.842815 | 80488.081484 | 23.307491 | 0.324573 | 0.145667 | 14.280618 | ... | 1.128288 | 59.765850 | 1.129406 | 0.021332 | 71.078665 | 0.0 | 144.592238 | 1.591377e+05 | 23113.297703 | 0.0 |
| std | 8357.381374 | 8154.68281 | 6776.671183 | 3.303851 | 3.711378 | 44982.239350 | 8.458977 | 0.793192 | 0.472828 | 6.226048 | ... | 5.206573 | 357.229252 | 3.509474 | 0.144490 | 43.324943 | 0.0 | 733.697102 | 1.387621e+05 | 20939.928901 | 0.0 |
| min | 1014.000000 | 1014.00000 | 1114.590204 | 36.000000 | 5.320006 | 14573.537170 | 0.675299 | 0.000000 | 0.000000 | 2.000000 | ... | 0.000003 | 0.000036 | 0.000036 | 0.000000 | 0.000000 | 0.0 | 1.000000 | 6.170000e+02 | 1000.000000 | 0.0 |
| 25% | 10042.250000 | 9255.00000 | 9845.684836 | 58.000000 | 9.298548 | 51678.569317 | 16.760928 | 0.000000 | 0.000000 | 10.000000 | ... | 0.021131 | 1.626781 | 0.478150 | 0.000000 | 35.000000 | 0.0 | 24.000000 | 5.025250e+04 | 8139.250000 | 0.0 |
| 50% | 16087.000000 | 13032.00000 | 12806.409390 | 59.000000 | 11.376376 | 69274.739920 | 22.652777 | 0.000000 | 0.000000 | 13.000000 | ... | 0.043496 | 3.336222 | 0.782344 | 0.000000 | 68.000000 | 0.0 | 35.000000 | 1.179930e+05 | 16712.500000 | 0.0 |
| 75% | 22098.000000 | 21801.00000 | 17748.224160 | 59.000000 | 14.189447 | 94458.038567 | 30.069674 | 0.000000 | 0.000000 | 16.000000 | ... | 0.071949 | 5.448862 | 1.072563 | 0.000000 | 104.000000 | 0.0 | 46.000000 | 2.278930e+05 | 32125.750000 | 0.0 |
| max | 35000.000000 | 34999.00000 | 34999.746430 | 59.000000 | 27.182348 | 406561.536400 | 39.629862 | 8.000000 | 5.000000 | 37.000000 | ... | 42.618882 | 4354.467419 | 166.833000 | 1.000000 | 161.000000 | 0.0 | 16421.000000 | 1.177412e+06 | 201169.000000 | 0.0 |
8 rows × 25 columns
fig = plt.figure(figsize=(20, 10))
ax = sns.boxplot(data= numeric_variables)
plt.title('Boxplot de las variables numéricas')
plt.tight_layout()
Insights
Se revisan las categorías y la frecuencia de ocurrencia de cada categoría para las variables categóricas:
categoric_variables = normal.select_dtypes(include=['object'])
categoric_variables
| ID | Grade | Employment Duration | Verification Status | Initial List Status | Application Type | |
|---|---|---|---|---|---|---|
| 0 | 65087372 | B | MORTGAGE | Not Verified | w | INDIVIDUAL |
| 1 | 1450153 | C | RENT | Source Verified | f | INDIVIDUAL |
| 2 | 1969101 | F | MORTGAGE | Source Verified | w | INDIVIDUAL |
| 3 | 6651430 | C | MORTGAGE | Source Verified | w | INDIVIDUAL |
| 4 | 14354669 | C | MORTGAGE | Source Verified | w | INDIVIDUAL |
| ... | ... | ... | ... | ... | ... | ... |
| 61217 | 7273094 | C | RENT | Verified | w | INDIVIDUAL |
| 61218 | 35182714 | C | RENT | Source Verified | w | INDIVIDUAL |
| 61219 | 16435904 | A | MORTGAGE | Verified | w | INDIVIDUAL |
| 61220 | 5300325 | D | OWN | Not Verified | f | INDIVIDUAL |
| 61221 | 65443173 | B | MORTGAGE | Verified | f | INDIVIDUAL |
61222 rows × 6 columns
text_cols_str = [text_col for text_col in normal.select_dtypes(include=['object']).columns]
normal[text_cols_str].describe()
| ID | Grade | Employment Duration | Verification Status | Initial List Status | Application Type | |
|---|---|---|---|---|---|---|
| count | 61222 | 61222 | 61222 | 61222 | 61222 | 61222 |
| unique | 61222 | 7 | 3 | 3 | 2 | 2 |
| top | 65087372 | C | MORTGAGE | Source Verified | w | INDIVIDUAL |
| freq | 1 | 17293 | 33128 | 29951 | 33072 | 61110 |
Verificamos que los valores de cada columna categórica esten dentro de las opciones posibles.
categoric_variables["Grade"].unique()
array(['B', 'C', 'F', 'A', 'G', 'E', 'D'], dtype=object)
categoric_variables["Employment Duration"].unique()
array(['MORTGAGE', 'RENT', 'OWN'], dtype=object)
categoric_variables["Verification Status"].unique()
array(['Not Verified', 'Source Verified', 'Verified'], dtype=object)
categoric_variables["Initial List Status"].unique()
array(['w', 'f'], dtype=object)
categoric_variables["Application Type"].unique()
array(['INDIVIDUAL', 'JOINT'], dtype=object)
Verificamos el porcentaje de distribución de las categorías de las variables categóricas:
categoric_variables
| ID | Grade | Employment Duration | Verification Status | Initial List Status | Application Type | |
|---|---|---|---|---|---|---|
| 0 | 65087372 | B | MORTGAGE | Not Verified | w | INDIVIDUAL |
| 1 | 1450153 | C | RENT | Source Verified | f | INDIVIDUAL |
| 2 | 1969101 | F | MORTGAGE | Source Verified | w | INDIVIDUAL |
| 3 | 6651430 | C | MORTGAGE | Source Verified | w | INDIVIDUAL |
| 4 | 14354669 | C | MORTGAGE | Source Verified | w | INDIVIDUAL |
| ... | ... | ... | ... | ... | ... | ... |
| 61217 | 7273094 | C | RENT | Verified | w | INDIVIDUAL |
| 61218 | 35182714 | C | RENT | Source Verified | w | INDIVIDUAL |
| 61219 | 16435904 | A | MORTGAGE | Verified | w | INDIVIDUAL |
| 61220 | 5300325 | D | OWN | Not Verified | f | INDIVIDUAL |
| 61221 | 65443173 | B | MORTGAGE | Verified | f | INDIVIDUAL |
61222 rows × 6 columns
categorical_variables = categoric_variables[["Initial List Status", "Application Type"]]
for col in categorical_variables:
print(f"Categorías y frecuencias para {col}")
print(normal[col].value_counts(normalize=True))
Categorías y frecuencias para Initial List Status Initial List Status w 0.540198 f 0.459802 Name: proportion, dtype: float64 Categorías y frecuencias para Application Type Application Type INDIVIDUAL 0.998171 JOINT 0.001829 Name: proportion, dtype: float64
categorical_variables = categoric_variables[["Grade", "Employment Duration", "Verification Status"]]
for col in categorical_variables:
print(f"Categorías y frecuencias para {col}")
print(normal[col].value_counts(normalize=True))
Categorías y frecuencias para Grade Grade C 0.282464 B 0.279426 A 0.178939 D 0.121901 E 0.095195 F 0.032880 G 0.009196 Name: proportion, dtype: float64 Categorías y frecuencias para Employment Duration Employment Duration MORTGAGE 0.541113 RENT 0.356718 OWN 0.102169 Name: proportion, dtype: float64 Categorías y frecuencias para Verification Status Verification Status Source Verified 0.489220 Verified 0.268335 Not Verified 0.242446 Name: proportion, dtype: float64
Verificamos si existen valores nulos en el conjunto de datos.
pd.isna(normal).sum()
ID 0 Loan Amount 0 Funded Amount 0 Funded Amount Investor 0 Term 0 Interest Rate 0 Grade 0 Employment Duration 0 Home Ownership 0 Verification Status 0 Debit to Income 0 Delinquency - two years 0 Inquires - six months 0 Open Account 0 Public Record 0 Revolving Balance 0 Revolving Utilities 0 Total Accounts 0 Initial List Status 0 Total Received Interest 0 Total Received Late Fee 0 Recoveries 0 Collection Recovery Fee 0 Collection 12 months Medical 0 Application Type 0 Last week Pay 0 Accounts Delinquent 0 Total Collection Amount 0 Total Current Balance 0 Total Revolving Credit Limit 0 Loan Status 0 dtype: int64
Verificamos si existen valores duplicados en el conjunto de datos.
normal.duplicated().sum()
0
Se genera un reporte completo, en formato HTML, para concluir con el proceso de perfilamiento:
profiling = ProfileReport(normal.copy())
profiling.to_file("profiling.html")
/Users/mariacatalinaibanezpineres/Desktop/MAESTRIA/2024-10/Machine Learning/Talleres/Taller5/env/lib/python3.11/site-packages/ydata_profiling/utils/dataframe.py:137: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.rename(columns={"index": "df_index"}, inplace=True)
Summarize dataset: 87%|████████▋ | 33/38 [00:00<00:00, 25.89it/s, Calculate auto correlation] /Users/mariacatalinaibanezpineres/Desktop/MAESTRIA/2024-10/Machine Learning/Talleres/Taller5/env/lib/python3.11/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'INDIVIDUAL'')
warnings.warn(
Summarize dataset: 100%|██████████| 440/440 [00:19<00:00, 22.23it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00, 3.38s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00, 3.38s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 65.39it/s]
Adicionalmente, se decide graficar la distribucion entre las variables numéricas:
plt.figure(figsize=(25, 20))
train_num = normal[list(numeric_variables.columns)]
sns.pairplot(train_num, diag_kind='kde')
plt.suptitle('Pairplot de las variables numéricas', size = 20)
plt.show()
<Figure size 2500x2000 with 0 Axes>